The development of modern unmanned systems places higher demands on the performance of the controllers. Typically, these systems are constrained, and violation of constraints may lead to unsafe behaviors that can severely affect system operation. Model predictive control (MPC) is capable of providing an optimal solution while handling the constraints explicitly, which has seen significant success in recent decades with wide applications in diverse fields [1], [2], [3], [4], [5], [6].
The successful applications of MPC attracted tremendous academic interest, and soon a rigorous stability-centered MPC theoretical foundation was established. The early stability results on MPC employ a zero-state terminal equality constraint [7], [8]. They were subsequently extended to the use of a terminal inequality constraint by taking a control Lyapunov function as the terminal cost [9], [10], [11], which established the currently well-known framework for stability. Based on this framework, a commonly adopted MPC design paradigm involves finding an appropriate terminal cost and a terminal controller as well as a terminal set that strictly meet specific conditions. While this is relatively easy for linear systems, it is generally a challenging task for nonlinear ones. Even if a qualified terminal cost, a terminal controller, and the corresponding terminal set are found, the conditions for stability and recursive feasibility are usually quite conservative. In practice, however, the theoretically well-established MPC approaches are prohibitively difficult to apply due to the high computational burden. Instead, MPC without terminal cost and terminal constraint enjoys wide popularity. Its closed-loop stability and recursive feasibility can be guaranteed under some conditions [12], [13]. But the absence of terminal cost entails that the residual costs outside the truncated prediction horizon are completely ignored, which makes it difficult to achieve optimal performance under a very limited prediction horizon. Motivated by the analysis above, we aim to develop a learning-based MPC scheme to learn an appropriate terminal cost function without complex offline design procedures while improving the performance of the controller.
In recent years, striking progress of reinforcement learning (RL), such as AlphaGo [14] and Atari games [15], has drawn the attention of the control community. Different from MPC’s offline design, RL optimizes the control policy through online data-based adaptation [16], [17]. Observed state transitions and costs (or rewards) are the only inputs to the RL agent, and no prior knowledge of the system dynamics is needed [18], [19]. A central idea in RL is temporal-difference (TD) learning [20], [21], [22], which estimates the value function directly from raw experience in a bootstrapping way, without waiting for a final outcome. The value function is a prediction of the expected long-term reward at each state. When the optimality is reached, it encodes the global optimal information so that the infinite-horizon optimal control policy can be obtained.
Nevertheless, to approximate the global optimal policy, the RL agent tends to try different policies and learns via trial and error, thereby struggling to provide safe guarantees on the resulting behaviors. This problem becomes prominent when confronted with some safety-critical systems, and safety in this context is defined in terms of stability. There have been some achievements centered on safe RL recently [23], [24], [25], but it remained largely an open field.
In view of the powerful data-driven optimization capability of RL, if combined with MPC to optimize the controller design with running data, it can be expected to improve the control performance. At the same time, benefiting from the solid theoretical foundation of MPC, the properties of the behaviors produced by the RL agent would be easier to analyze. This motivates the attempts to integrate the best of both worlds. However, only limited pioneering work has been reported in this field. For example, in [26], a novel “plan online and learn offline” framework was proposed by taking MPC as the trajectory optimizer of RL. It is shown that MPC enables a more efficient exploration, thereby accelerating the convergence and reducing the approximation errors in the value function. The related work was extended to the framework of actor-critic (AC) in [27] to handle the case of sparse and binary reward, which shows that MPC is an important sampler that effectively propagates global information. These results were further applied to the tasks defined by simple and easily interpretable high-level objectives on a real unmanned ground vehicle in [28], and the expression of the value function was improved. Yet, the focus of these works is on the improvements that MPC brings to RL practice, with no reference to stability. The stability issue was highlighted in [29] where an economic MPC was used as a function approximator in RL. This scheme was further analyzed in [30] under an improved algorithmic framework. These two papers paved the way for [31], where the constraint satisfaction concern was addressed with the help of robust linear MPC. The practicality of these approaches, however, remains to be investigated and verified. To the best of the authors’ knowledge, few studies can balance the theoretical and practical guarantees of the “RL + MPC” approaches, that is, to provide a theoretically safe and operationally efficient scheme, and this article aims to fill this gap.
We propose an “RL + MPC” scheme from a new perspective: policy iteration (PI). Considering that MPC’s ability to handle constraints can provide constraint-enforcing policies for RL agents, safety referred to in this article is not only limited to stability, but also constraint satisfaction. The main contributions of this article contain threefold as follows.
We provide a new idea for combining RL and MPC by bridging them through PI. In this way, a complete RL-based model predictive control (RLMPC) scheme is developed for discrete-time systems. In each iteration, the current policy is evaluated through learning to obtain the value function. Then it is employed as the terminal cost to compensate for the suboptimality induced by the truncated prediction horizon of MPC, thus improving the policy. This solves the challenge of complex offline design procedures while progressively improving the performance to (near-)optimum.
The convergence of learning, recursive feasibility, and closed-loop stability of the proposed RLMPC are closely investigated, thereby theoretically guaranteeing its safety. We demonstrate that no constraint is violated even before the optimal value function is well-learned, which verifies the ability of MPC to effectively constrain the RL-produced policies within safe limits.
We incorporate the value function approximation (VFA) technique into the developed RLMPC approach to approximate the global optimal value function with high computational efficiency. The effect of the prediction horizon on the control performance is scrutinized. Results show that the proposed RLMPC scheme can achieve nearly the same optimal performance as traditional MPC in linear system control and outperform it in nonlinear cases, which is due to the conservativeness of the offline MPC design for nonlinear systems. In addition, thanks to the elimination of the terminal controller and terminal constraint, the RLMPC scheme is particularly useful in dealing with systems where it is difficult or even impossible to design a terminal controller, such as nonholonomic systems, while reducing computational burden by flexibly adjusting its prediction horizon.
The rest of this article is organized as follows. In Section II, we formally outline the problem formulation. The RLMPC scheme is developed in Section III. Its convergence, optimality, feasibility, and stability properties are analyzed in Section IV. Section V discusses the implementation of the overall scheme. We test the proposed scheme in simulations of both linear and nonlinear examples in Section VI. Section VII provides final conclusion.
Notation: We use the following convention: R
denotes the set of reals and Rn
is an n
-dimensional set of real numbers. N
is the set of natural numbers. For some r1,r2∈R
, we use R≥r1
, N>r2
, and N[r1,r2]
to represent the sets {r∈R|r≥r1}
, {r∈N|r>r2}
, and {r∈N|r1≤r≤r2}
, respectively. We label the variables of the optimal solution with ⋅∗
, feasible solution with ⋅~
, and estimated ones with ⋅^
, respectively. Moreover, the notations xi|k
and ui|k
indicate the state and input prediction i
steps ahead from the current time k
, respectively. The sequence {u0|k,u1|k,…,uN−1|k}
is denoted by uk
or by uN∥k
if we want to emphasize its length N
.
SECTION II.
Problem Formulation
A. Optimal Control Problem
Consider a dynamic system described by the following state space difference equation:
xk+1=f(xk,uk)(1)
View Source
\begin{equation*} x_{k+1} = f(x_{k},u_{k}) \tag{1}\end{equation*}
where
xk∈Rn
and
uk∈Rm
are the system state and control input at time
k∈N
, respectively. It is assumed that the system is subject to the constraints
xk∈X,uk∈U(2)
View Source
\begin{equation*} x_{k} \in \mathbb {X}, \quad u_{k} \in \mathbb {U} \tag{2}\end{equation*}
where
X
and
U
are compact sets and contain the origin as an interior point. Also, system
(1) is assumed to satisfy the following conditions.
Assumption 1:
The function f:Rn×Rm→Rn
is continuous with f(0,0)=0
. Under the constraints (2), the system (1) is stabilizable with respect to the equilibrium at xk=0
, uk=0
, and the system state xk
is measurable.
Assumption 1 is a nonintrusive and common assumption in the MPC community and can be found in numerous literatures (e.g., [9], [10], [11], [12], [13]). For a system x˘k+1=f(x˘k,u˘k)
with a general equilibrium (xe,ue)
, one can always make variable changes xk=x˘k−xe
and uk=u˘k−ue
to translate it to system (1) with the equilibrium (0,0)
.
For the particular setup considered above, the infinite-horizon optimal control problem at time k
with the initial state xk∈X
is then described by Problem 1.
Problem 1:
Find a control policy uk=π(xk)
, which is defined as a map from state space to control space π:Rn→Rm
and results in a control sequence u∞={uk,uk+1,…}
, such that it stabilizes the zero equilibrium of (1) and minimizes the infinite-horizon cost
J∞(xk,u∞)=∑i=k∞ℓ(xi,ui)(3)
View Source
\begin{equation*} J_{\infty }(x_{k}, \boldsymbol {u}_{\infty}) = \sum _{i=k}^{\infty } \ell (x_{i}, u_{i}) \tag{3}\end{equation*}
subject to constraints
(1) and
(2), with the running cost
ℓ:Rn×Rm→R≥0
satisfying
ℓ(0,0)=αK∞1(∥x∥)≤0ℓˇ(x)≜infu∈Uℓ(x,u)≤αK∞2(∥x∥)(4)
View Source
\begin{align*} \ell (0, 0)=&0 \\ \alpha _{1}^{\mathcal {K}_{\infty} }(\|x \|)\leq&\check {\ell }(x) \triangleq \inf _{u \in \mathbb {U}} \ell (x,u) \leq \alpha _{2}^{\mathcal {K}_{\infty }}(\|x \|) \tag{4}\end{align*}
for
∀x∈X
, where
αK∞1
and
αK∞2
are
K∞
functions
[32].
The running cost defined by (4) is a fairly general form and is common in many MPC research studies (e.g., [13], [33]). Its specific form is task-dependent, for example, it usually takes a quadratic form in regulation problems: ℓ(x,u)=xTQx+uTRu
, where Q
and R
are set to be positive-definite matrices with proper dimensions.
B. Background of Model Predictive Control
Generally, there is no analytic solution to Problem 1, especially for constrained nonlinear systems. An efficient method to solve this problem is MPC [9], which provides an approximated solution by recursively solving the following OP 1 at each xk∈X
.
OP 1: Find the optimal control sequence u∗k
by solving the finite-horizon minimization problem
minukJ(xk,uk)=∑i=0N−1ℓ(xi|k,ui|k)+F(xN|k)s.t.x0|k=xks.t.xi+1|k=f(xi|k,ui|k),0≤i≤N−1s.t.ui|k∈U,0≤i≤N−1s.t.xi|k∈X,1≤i≤N−1s.t.xN|k∈XΩ(5a)(5b)(5c)(5d)(5e)(5f)
View Source
\begin{align*} &\!\min _{ \boldsymbol {u}_{k}}\, J(x_{k}, \boldsymbol {u}_{k})=\sum _{i=0}^{N-1} \ell (x_{i|k}, u_{i|k}) + F(x_{N|k}) \tag{5a}\\ &\text {s.t.} x_{0|k} = x_{k} \tag{5b}\\ &\hphantom {\text {s.t.} } x_{i+1|k} = f(x_{i|k},u_{i|k}),\quad 0\le i\le N - 1 \tag{5c}\\ &\hphantom {\text {s.t.} } u_{i|k} \in \mathbb {U}, \quad 0\le i\le N-1 \tag{5d}\\ &\hphantom {\text {s.t.} } x_{i|k}\in \mathbb {X}, \quad 1\le i\le N-1 \tag{5e}\\ &\hphantom {\text {s.t.} } x_{N|k} \in \mathbb {X}_{\Omega} \tag{5f}\end{align*}
Where
N∈N>0
is the prediction horizon,
F(⋅)
is the
terminal cost,
uk={u0|k,u1|k,…,uN−1|k}
is the predicted control sequence,
XΩ
is the
terminal set, and
(5f) is the
terminal constraint.
After solving OP 1, apply only the first element in u∗k
, that is, uk=u∗0|k
, to system (1) and repeat this procedure.
To guarantee the recursive feasibility and stability, it is required that N
is chosen properly so that xN|k
is guaranteed to enter a positively invariant set XΩ
(under a terminal controller κ(xk)∈U
) containing the origin. Moreover, parameters of the running and terminal costs should be designed to satisfy F(xN|k)≥∑∞i=Nℓ(xi|k,κ(xi|k))
for ∀xk∈XΩ
, such that F(⋅)
is a local Lyapunov function for κ(xk)
in XΩ
.
As mentioned in Section I, it is generally a difficult task to design such terminal conditions (i.e., terminal cost, terminal constraint, and terminal controller) for nonlinear systems. Although MPC without terminal condition (MPCWTC) is proposed as an option to circumvent this challenge, its performance under a very limited (but necessary) prediction horizon tends to be inferior to that of the kind with terminal cost. Therefore, we propose to introduce RL into MPC to learn the terminal cost online, thereby generating a policy that is closer to the optimal one in the sense of an infinite horizon.
C. Background of RL
An important class of RL techniques aims to learn the optimal policy π∗(x)
by iteratively obtaining the evaluation of the current policy and then improving the policy accordingly [34]. At state xk
, the value function Vπ:Rn→R
, or the evaluation, for a given policy uk=π(xk)
is the accumulated rewards (or running costs) by following the policy from xk
, that is, Vπ(xk)=∑∞i=kℓ(xi,ui)
, which can be rearranged into the well-known Bellman equation
Vπ(xk)=ℓ(xk,uk)+Vπ(xk+1).(6)
View Source
\begin{equation*} V_{\pi} (x_{k}) = \ell (x_{k}, u_{k}) + V_{\pi} (x_{k+1}). \tag{6}\end{equation*}
According to Bellman’s principle, when the optimality is achieved, the optimal value function V∗π(xk)
satisfies the Bellman optimality equation [35]
V∗π(xk)=minuk{ℓ(xk,uk)+V∗π(xk+1)}(7)
View Source
\begin{equation*} V_{\pi }^{\ast} \left ({x_{k}}\right) = \min _{u_{k}} \left \{{ \ell \left ({x_{k}, u_{k}}\right) + V_{\pi }^{\ast}\left ({x_{k+1}}\right) }\right \} \tag{7}\end{equation*}
and the corresponding optimal policy is given by
π∗(xk)=argminuk{ℓ(xk,uk)+V∗π(xk+1)}.(8)
View Source
\begin{equation*} \pi ^{\ast} \left ({x_{k}}\right) = \arg \min _{u_{k}} \left \{{\ell \left ({x_{k}, u_{k}}\right) + V_{\pi }^{\ast}\left ({x_{k+1}}\right)}\right \}. \tag{8}\end{equation*}
Nevertheless, the calculation of Vπ(xk)
is generally computationally difficult, and V∗π(xk)
is unknown before all the policies π(xk)=uk∈U
are considered. This motivates the TD learning approach [36], which provides a more tractable computation. In TD learning, the running costs are observed to construct the TD target VTD(xk)
, thereby iteratively updating the value function as
Vπ(xk)←Vπ(xk)+α[VTD(xk)−Vπ(xk)](9)
View Source
\begin{equation*} V_{\pi} (x_{k}) \leftarrow V_{\pi} (x_{k}) + \alpha \left [{V_{\mathrm{ TD}}(x_{k}) - V_{\pi} (x_{k}) }\right] \tag{9}\end{equation*}
where
α∈(0,1)
is a constant called the learning step size, and
VTD(xk)−Vπ(xk)
is known as the TD error. The Bellman
equation (6) holds whenever the TD error is zero, thus obtaining the evaluation of policy
π(xk)
. Here, to be consistent with MPC, the TD target can be constructed into the
N
-step form
VTD(xk)=∑i=kk+N−1ℓ(xi,ui)+Vπ(xk+N).(10)
View Source
\begin{equation*} V_{\mathrm{ TD}}(x_{k}) = \sum _{i = k}^{k+N-1} \ell (x_{i}, u_{i}) + V_{\pi} (x_{k+N}). \tag{10}\end{equation*}
It can be seen that this TD target has a similar form as the MPC cost function (5a), which builds up the connection between MPC and RL, and their combination gives rise to the RLMPC algorithm to be presented in Section III– V.
In the proposed RLMPC, the value function and control policy are updated in the PI manner. Specifically, we take a currently obtained heuristic term as the terminal cost of MPC, and the corresponding control input is the current control policy. The costs incurred by following this policy, which can be regarded as the rewards for the controller interacting with the system dynamics, are used as training data for the learning of the value function. VFA technique is employed to approximate the underlying value function, that is, the evaluation of the current policy. With the latest learned value function, we take it as the heuristic term in the cost function, thus updating the MPC controller, that is, the current policy. Fig. 1 illustrates the RLMPC pipeline.
A. Policy Generator—MPC
In this section, we formulate the MPC problem to be solved at each step (see OP 2). Since MPC only looks N
steps forward at each control period, it ultimately produces a locally optimal policy for Problem 1 unless coupled with a terminal cost that propagates global information [26]. Motivated by this, we take the heuristic term learned through RL, namely the value function, as the terminal cost. Based on the results presented in [13], we modify the formulation of MPC as follows.
OP 2: Find the optimizer uk
of the following problem:
minukJ(xk,uk)=∑i=0N−1ℓ(xi|k,ui|k)+Vπ(xN|k)s.t.x0|k=xks.t.xi+1|k=f(xi|k,ui|k),0≤i≤N−1s.t.ui|k∈U,0≤i≤N−1s.t.xi|k∈X,1≤i≤N(11a)(11b)(11c)(11d)(11e)
View Source
\begin{align*} &\!\min _{ \boldsymbol {u}_{k}}\, J(x_{k}, \boldsymbol {u}_{k})= \sum _{i=0}^{N-1} \ell (x_{i|k}, u_{i|k}) + V_{\pi} (x_{N|k}) \tag{11a}\\ &\text {s.t.} x_{0|k} = x_{k} \tag{11b}\\ &\hphantom {\text {s.t.} } x_{i+1|k} = f(x_{i|k},u_{i|k}),0\le i\le N - 1 \tag{11c}\\ &\hphantom {\text {s.t.} } u_{i|k} \in \mathbb {U},\quad 0\le i\le N-1 \tag{11d}\\ &\hphantom {\text {s.t.} } x_{i|k}\in \mathbb {X},\quad 1\le i\le N \tag{11e}\end{align*}
where the control sequence
uk={u0|k,u1|k,…,uN−1|k}
is the decision variable, and
Vπ(⋅)
is a heuristic term obtained through the learning technique to be presented in
Section III-B. We define the MPC control policy generated under a fixed
Vπ(⋅)
to be a fixed policy.
Up to this point, we obtain the policy π(xk)
implicitly defined by OP 2: solving OP 2 at time k
to yield a control sequence u∗k={u∗0|k,u∗1|k,…,u∗N−1|k}
, and the first element is applied to system (1).
A significant advantage of this formulation lies in the fact that it not only inherits the advantages of MPCWTC [13], but also has the potential to further improve the control performance by compensating the costs neglected by the truncated prediction horizon through the learned heuristic function term.
In this article, the initial policy π0(xk)
is generated by solving OP 2 with Vπ(⋅)≡0
and implemented in the receding horizon fashion, which is exactly the MPCWTC. To ensure its stability and recursive feasibility, we make the following assumptions.
Assumption 2 (Controllability Assumption):
There exists a positive constant β∈R>0
such that
V∗π(xk)≤β⋅ℓˇ(xk)∀xk∈X(12)
View Source
\begin{equation*} V_{\pi }^{\ast} (x_{k}) \leq \beta \cdot \check {\ell }(x_{k})\quad \forall x_{k} \in \mathbb {X} \tag{12}\end{equation*}
where
V∗π(xk)
and
ℓˇ(xk)
are defined in
(7) and
(4), respectively.
Lemma 1 (See[13]):
If the prediction horizon N
of the MPCWTC is chosen to satisfy
N>2lnβlnβ−ln(β−1)(13)
View Source
\begin{equation*} N > \frac { 2\ln \beta }{\ln \beta - \ln (\beta -1)} \tag{13}\end{equation*}
it is guaranteed that the closed-loop system is asymptotically stable and the MPCWTC optimization problem is recursively feasible.
Assumption 3:
In this context, the prediction horizon of the initial controller satisfies (13).
Assumption 2 is a fairly standard condition in the MPCWTC literature (e.g., [13], [37], [38]) and is also known as the boundedness condition of the optimal value function. It indicates the controllability of system (1) with respect to the equilibrium, since otherwise V∗π(xk)
would go to infinity [13]. Lemma 1 and Assumption 3 jointly determine that the initial controller is stable, which is consistent with the requirement on the initial policy for PI (to be introduced in Section III-C). The method to calculate a suitable β
can be found in [39] and [40].
B. Learning of the Value Function
Now we focus on how to perform policy evaluation on a fixed control policy π(xk)
at time k
. As mentioned in Section II-C, the essence of policy evaluation is to obtain Vπ(xk)
for π(xk)
on the set X
. For the continuous state space, however, it is impossible to obtain Vπ(xk)
at every xk∈X
, because we could not traverse all states over the state space. Therefore, the VFA technique [36] is adopted in this article to approximate the existing true value function.
According to the higher-order Weierstrass approximation theorem [41], there exists a dense polynomial basis set {Φi(xk)}
such that one can approximate the true value function Vπ(xk)
, ∀xk∈X
in the following form:
Vπ(xk)==∑i=1∞WiΦi(xk)=∑i=1pWiΦi(xk)+∑i=p+1∞WiΦi(xk)WTΦ(xk)+e(xk)=V^π(xk,W)+e(xk)(14)
View Source
\begin{align*} V_{\pi} (x_{k})=&\sum _{i=1}^{\infty } W_{i} \Phi _{i}(x_{k}) = \sum _{i=1}^{p} W_{i} \Phi _{i}(x_{k}) + \sum _{i=p+1}^{\infty } W_{i} \Phi _{i}(x_{k}) \\=&W^{\mathrm {T}} \Phi (x_{k}) + e(x_{k}) = \hat {V}_{\pi} (x_{k}, W) + e(x_{k}) \tag{14}\end{align*}
where
V^π(xk,W)
is an approximation of
Vπ(xk)
,
W=[W1,…,Wp]T∈Rp
is the weight vector, and
Φ(xk)=[Φ1(xk),…,Φp(xk)]T
is the basis vector. The approximation error
e(xk)
converges uniformly to zero as
p→∞
. Note that this approximation is essentially a single-layer network, and all we have to do is to feed training data into this network.
Consider a set containing q
(q∈N>0
) states sampled in X
at time k
, denoted as Sk={xk1,xk2,…,xkq}⊂X
. Take each of its elements separately as initial states and propagate one step using this policy, then we record the incurred costs as training data. Specifically, for each xkj∈Sk
, j∈N[1,q]
, solve OP 2 (whose terminal cost is a fixed heuristic function corresponding to the current policy) with the subscript k
replaced by kj
, obtain the solution u∗kj
, and calculate the cost J(xkj,u∗kj)
according to (11a). Note that we do not apply any u∗kj
to the real system in this process, so the new subscript kj
is adopted here to distinguish it from the actual system state.
With all these costs J(xkj,u∗kj)
, j∈N[1,q]
, we are now able to train the network similar to the TD learning. Recall that the purpose of (9) is to approximate the TD target with Vπ(xk)
, we now directly use V^π(xkj,W)
to approximate the targets J(xkj,u∗kj)
V^π(xkj,W)←V^π(xkj,W)+α[J(xkj,u∗kj)−V^π(xkj,W)](15)
View Source
\begin{align*} \hat {V}_{\pi} (x_{k_{j}}, W) \leftarrow \hat {V}_{\pi} (x_{k_{j}}, W) + \alpha \left [{J\left ({x_{k_{j}}, \boldsymbol {u}^{\ast}_{k_{j}}}\right) - \hat {V}_{\pi} (x_{k_{j}}, W) }\right]\!\!\! \tag{15}\end{align*}
for
∀j∈N[1,q]
, and we aim to minimize
E(xkj)=1/2e(xkj)2
, where
e(xkj)=J(xkj,u∗kj)−V^π(xkj,W)
. The stochastic gradient descent method
[36] achieves this minimization by adjusting the weight vector
W
in a small step size
α∈(0,1)
at a time in the direction of the negative gradient (with respect to
W
) of the squared error
W←==W−12α∇W[J(xkj,u∗kj)−V^π(xkj,W)]2W+α[J(xkj,u∗kj)−V^π(xkj,W)]∇WV^π(xkj,W)W+α[J(xkj,u∗kj)−WTΦ(xkj)]Φ(xkj),j∈N[1,q](16)
View Source
\begin{align*} W\leftarrow&W-\frac {1}{2}\alpha \nabla _{W} \left [{J\left ({x_{k_{j}}, \boldsymbol {u}^{\ast}_{k_{j}}}\right)-\hat {V}_{\pi} \left ({x_{k_{j}}, W}\right)}\right]^{2} \\=&W+\alpha \left [{J\left ({x_{k_{j}}, \boldsymbol {u}^{\ast}_{k_{j}}}\right)-\hat {V}_{\pi} \left ({x_{k_{j}}, W}\right)}\right]\nabla _{W} \hat {V}_{\pi} \left ({x_{k_{j}}, W}\right) \\=&W+\alpha \left [{J\left ({x_{k_{j}}, \boldsymbol {u}^{\ast}_{k_{j}}}\right)- W^{\mathrm {T}} \Phi \left ({x_{k_{j}}}\right)}\right] \Phi \left ({x_{k_{j}}}\right), \quad j \in \mathbb {N}_{\left [{1,q}\right]} \tag{16}\end{align*}
in which
∇W
denotes the partial derivatives with respect to the components of
W
.
After W
converges for all the training data, that is, for ∀xkj∈Sk
, [J(xkj,u∗kj)−V^π(xkj,W)]→0
holds, we obtain the (approximated) evaluation of the given fixed control policy. Provided with sufficient training data and a proper choice of the basis vector, we have V^π(xk,W)→Vπ(xk)
, ∀xk∈X
. It is worth noting that with VFA, we can characterize the value function over the state space using only the storage space of a p
-dimensional vector W
, which enhances the utility of the proposed approach.
C. PI in RLMPC
In RLMPC, the value function and the control policy are updated by iterations with index t∈N
increasing from zero to infinity. For convenience, we denote the heuristic term employed in the t
th iteration as Vtπ(⋅)
, and the policy generated by the MPC with Vtπ(⋅)
as the terminal cost is denoted as πt(⋅)
. We are now in a position to present the PI mechanism in RLMPC.
Initialization: Given an initial state xk∈X
at time k
, we start the iteration with an initial heuristic term V0π(⋅)≡0
and a corresponding stabilizing policy π0(xk)
whose existence is already guaranteed by Assumptions 2 and 3. For t=0,1,2,…
, do the following two steps iteratively.
Policy Evaluation Step: Determine the value function Vt+1π(⋅)
for policy πt(⋅)
by following the steps presented in Section III-B: first, sample in X
to form Sk
. Second, generate training data Jt(xkj,utkj)
, xkj∈Sk
, with
Jt(xkj,utkj)==minukj∈U{∑i=0N−1ℓ(xi|kj,ui|kj)+Vtπ(xN|kj)}∑i=0N−1ℓ(xti|kj,uti|kj)+Vtπ(xtN|kj)(17)
View Source
\begin{align*} J^{t}(x_{k_{j}}, \boldsymbol {u}_{k_{j}}^{t})=&\min _{ \boldsymbol {u}_{k_{j}} \in \mathbb {U}}\left \{{ \sum _{i=0}^{N-1} \ell (x_{i|k_{j}}, u_{i|k_{j}}) + V^{t}_{\pi }(x_{N|k_{j}}) }\right \} \\=&\sum _{i=0}^{N-1} \ell \left ({x_{i|k_{j}}^{t}, u_{i|k_{j}}^{t}}\right) + V_{\pi }^{t}\left ({x_{N|k_{j}}^{t}}\right) \tag{17}\end{align*}
where xkj=xt0∣kj
, xti∣kj
, and uti∣kj
, i∈N[0,N−1]
are deduced from model (11c), satisfying (11d) and (11e). Third, train the network using (16) to obtain the value function Vt+1π(xk)→Jt(xk,utk)∀xk∈X.(18)
View Source
\begin{equation*} V^{t+1}_{\pi} (x_{k}) \rightarrow J^{t}(x_{k}, \boldsymbol {u}_{k}^{t})\quad \forall x_{k} \in \mathbb {X}. \tag{18}\end{equation*}
Policy Improvement Step: According to the descriptions in Section III-A, the updated policy πt+1(xk)
is yielded by solving OP 2 with terminal cost Vt+1π(⋅)
ut+1k=argminut+1k∈U{∑i=0N−1ℓ(xi|k,ui|k)+Vt+1π(xN|k)}(19)
View Source
\begin{equation*} \boldsymbol {u}_{k}^{t+1}=\arg \min _{ \boldsymbol {u}^{t+1}_{k} \in \mathbb {U}} \left \{{\sum _{i=0}^{N-1} \ell (x_{i|k}, u_{i|k}) + V_{\pi }^{t+1}(x_{N|k})}\right \}\quad \tag{19}\end{equation*}
and ut+1k
is implemented in a receding horizon manner.
In what follows, we focus on analyzing the theoretical properties of this RLMPC scheme.
SECTION IV.
Convergence, Feasibility, and Stability Analysis
In this section, we analyze the convergence, feasibility, and stability properties. We show that for ∀xk∈X
, the value function Vtπ(xk)→V∗π(xk)
and the control policy πt(xk)→π∗(xk)
as the iteration index t→∞
. Also, we demonstrate that the policy obtained in each iteration is a feasible and stabilizing policy for the closed-loop system. The convergence proof is inspired by [42] which demonstrates the convergence of optimal control (unconstrained infinite horizon optimization) under PI. This article leverages its proof idea to extend the results to RLMPC. Before proceeding, we introduce the following definition.
Definition 1:
For the RLMPC optimization problem OP 2, a control sequence u~k={u~0|k,…,u~N−1|k}
is said to be feasible if u~i|k∈U,i=N[0,N−1]
and the resulting system trajectory x~k={x~0|k,…,x~N|k}
satisfies x~i|k∈X,i=N[0,N]
, rendering the cost function J(x~k,u~k)
in the form of (11a) finite.
A. Convergence Analysis
We first present two theorems to respectively demonstrate that the value function Vtπ(xk)
is uniformly monotonically nondecreasing with respect to the iteration index t
and show that it has an upper bound. Then, based on the two theorems, it is shown in the third theorem that Vtπ(xk)
converges to the optimal solution as t→∞
.
Theorem 1:
For ∀t∈N
and xk∈X
, let Vtπ(xk)
and πt(xk)
be obtained by PI presented in Section III-C, then Vtπ(xk)
is a uniformly monotonically nondecreasing sequence for t
, that is, ∀t:Vtπ(xk)≤Vt+1π(xk)
.
Proof:
Define a new value function Γt(xk)
with Γ0(⋅)≡0
. It is followed by the update law that
Γt(xk)=∑i=0N−1ℓ(x~i|k,u~i|k)+Γt−1(x~N|k)(20)
View Source
\begin{equation*} \Gamma ^{t}(x_{k})= \sum _{i=0}^{N-1} \ell (\tilde {x}_{i|k}, \tilde {u}_{i|k}) + \Gamma ^{t-1}(\tilde {x}_{N|k}) \tag{20}\end{equation*}
where
{u~0|k,…,u~N−1|k}=u~k
is an arbitrary feasible control sequence as defined in
Definition 1,
x~0|k=xk
, and
x~i|k,i=N[1,N]
, is the resulting trajectory of
u~k
. Since
Γ0(⋅)=V0π(⋅)=0
, we have
Γ1(xk)=≥==∑i=0N−1ℓ(x~i|k,u~i|k)+Γ0(x~N|k)minuk∈U{∑i=0N−1ℓ(xi|k,ui|k)+Γ0(xN|k)}minuk∈U{∑i=0N−1ℓ(xi|k,ui|k)+V0π(xN|k)}∑i=0N−1ℓ(x0i|k,u0i|k)+V0π(x0N|k)=V1π(xk).(21)
View Source
\begin{align*} \Gamma ^{1}(x_{k})=&\sum _{i=0}^{N-1} \ell (\tilde {x}_{i|k}, \tilde {u}_{i|k}) + \Gamma ^{0}(\tilde {x}_{N|k}) \\\geq&\min _{ \boldsymbol {u}_{k} \in \mathbb {U}}\left \{{ \sum _{i=0}^{N-1} \ell (x_{i|k}, u_{i|k}) + \Gamma ^{0}(x_{N|k}) }\right \} \\=&\min _{ \boldsymbol {u}_{k} \in \mathbb {U}}\left \{{ \sum _{i=0}^{N-1} \ell (x_{i|k}, u_{i|k}) + V^{0}_{\pi }(x_{N|k}) }\right \} \\=&\sum _{i=0}^{N-1} \ell \left ({x_{i|k}^{0}, u^{0}_{i|k}}\right) + V^{0}_{\pi }\left ({x_{N|k}^{0}}\right) =V^{1}_{\pi }\left ({x_{k}}\right). \tag{21}\end{align*}
The last two equations hold from (17) and (18), indicating the following update law:
Vt+1π(xk)===Jt(xk,utk)∑i=0N−1ℓ(xti|k,uti|k)+Vtπ(xtN|k)minuk∈U{∑i=0N−1ℓ(xi|k,ui|k)+Vtπ(xN|k)}.(22)
View Source
\begin{align*} V_{\pi } ^{t+1} \left ({x_{k}}\right)=&J^{t}\left ({x_{k}, \boldsymbol {u}_{k}^{t}}\right) \\=&\sum _{i=0}^{N-1} \ell \left ({x_{i|k}^{t}, u_{i|k}^{t}}\right) + V_{\pi }^{t}\left ({x_{N|k}^{t}}\right) \\=&\min _{ \boldsymbol {u}_{k} \in \mathbb {U}} \left \{{\sum _{i=0}^{N-1} \ell (x_{i|k}, u_{i|k}) + V_{\pi }^{t}(x_{N|k})}\right \}. \tag{22}\end{align*}
Assume that at t=l
, Γl(xk)≥Vlπ(xk)
holds for ∀xk∈X
, then for t=l+1
Γl+1(xk)=≥=≥≥=∑i=0N−1ℓ(x~i|k,u~i|k)+Γl(x~N|k)minuk∈U{∑i=0N−1ℓ(xi|k,ui|k)+Γl(xN|k)}∑i=0N−1ℓ(x¯li|k,u¯li|k)+Γl(x¯lN|k)∑i=0N−1ℓ(x¯li|k,u¯li|k)+Vlπ(x¯lN|k)minuk∈U{∑i=0N−1ℓ(xi|k,ui|k)+Vlπ(xN|k)}∑i=0N−1ℓ(xli|k,uli|k)+Vlπ(xlN|k)=Vl+1π(xk)
View Source
\begin{align*} \Gamma ^{l+1}(x_{k})=&\sum _{i=0}^{N-1} \ell (\tilde {x}_{i|k}, \tilde {u}_{i|k}) + \Gamma ^{l}(\tilde {x}_{N|k})\\\geq&\min _{ \boldsymbol {u}_{k} \in \mathbb {U}}\left \{{ \sum _{i=0}^{N-1} \ell (x_{i|k}, u_{i|k}) + \Gamma ^{l}(x_{N|k}) }\right \}\\=&\sum _{i=0}^{N-1} \ell \left ({\bar {x}^{l}_{i|k}, \bar {u}^{l}_{i|k}}\right) + \Gamma ^{l}\left ({\bar {x}^{l}_{N|k}}\right) \\\geq&\sum _{i=0}^{N-1} \ell \left ({\bar {x}^{l}_{i|k}, \bar {u}^{l}_{i|k}}\right) + V_{\pi }^{l}\left ({\bar {x}^{l}_{N|k}}\right) \\\geq&\min _{ \boldsymbol {u}_{k} \in \mathbb {U}}\left \{{ \sum _{i=0}^{N-1} \ell (x_{i|k}, u_{i|k}) + V^{l}_{\pi }(x_{N|k}) }\right \}\\=&\sum _{i=0}^{N-1} \ell \left ({x_{i|k}^{l}, u^{l}_{i|k}}\right) + V^{l}_{\pi }\left ({x^{l}_{N|k}}\right) =V^{l+1}_{\pi }\left ({x_{k}}\right)\end{align*}
where
u¯li|k
,
x¯li|k
,
i=0,1,…
, are the solutions to the problem
minuk∈U{∑N−1i=0ℓ(xi|k,ui|k)+Γl(xN|k)}
. Through induction, we conclude that
∀t:Γt(xk)≥Vtπ(xk).(23)
View Source
\begin{equation*} \forall t: \Gamma ^{t}(x_{k}) \geq V^{t}_{\pi }(x_{k}). \tag{23}\end{equation*}
Next, we use this conclusion to prove that Vtπ(xk)
is a uniformly monotonically nondecreasing sequence about t
. Again, the mathematical induction method is adopted. First, since V1π(xk)=∑N−1i=0ℓ(x0i|k,u0i|k)≥0
and Γ0(xk)=0
, it is obvious that Γ0(xk)≤V1π(xk)
. Second, if we let the arbitrary feasible policy u~k
be utk
in (20), we have the following update law at t=l
:
Γl(xk)=∑i=0N−1ℓ(xli|k,uli|k)+Γl−1(xlN|k).(24)
View Source
\begin{equation*} \Gamma ^{l}\left ({x_{k}}\right) = \sum _{i=0}^{N-1} \ell \left ({x^{l}_{i|k}, u_{i|k}^{l}}\right) + \Gamma ^{l-1}\left ({x^{l}_{N|k}}\right). \tag{24}\end{equation*}
At this time, assume that Γl−1(xk)≤Vlπ(xk)
holds for ∀xk∈X
, then at t=l+1
, from (24) and (22), we derive
Γl(xk)−Vl+1π(xk)=Γl−1(xlN|k)−Vlπ(xlN|k)≤0.(25)
View Source
\begin{equation*} \Gamma ^{l}\left ({x_{k}}\right)-V_{\pi }^{l+1}\left ({x_{k}}\right) = \Gamma ^{l-1}\left ({x^{l}_{N|k}}\right)-V_{\pi }^{l}\left ({x^{l}_{N|k}}\right) \leq 0. \tag{25}\end{equation*}
Hence, Γl(xk)≤Vl+1π(xk)
holds. Therefore, we conclude that ∀t:Γt(xk)≤Vt+1π(xk)
. Combining (23), we can draw the conclusion that
∀t:Vtπ(xk)≤Γt(xk)≤Vt+1π(xk)(26)
View Source
\begin{equation*} \forall t: V^{t}_{\pi }(x_{k}) \leq \Gamma ^{t}(x_{k}) \leq V^{t+1}_{\pi }(x_{k}) \tag{26}\end{equation*}
which completes the proof.
Since V0π(⋅)=0
and the running cost (4) is positive-definite, Theorem 1 indicates that the value function Vtπ(xk)
obtained in any iteration t∈N≥1
is positive-definite.
Theorem 2:
For ∀t∈N
and xk∈X
, let Vtπ(xk)
and πt(xk)
be obtained by PI presented in Section III-C, and V∗π(xk)
be the optimal value function satisfying the Bellman optimality equation
V∗π(xk)=minuk∈U{∑i=0N−1ℓ(xi|k,ui|k)+V∗π(xN|k)}(27)
View Source
\begin{equation*} V^{\ast}_{\pi }(x_{k}) = \min _{ \boldsymbol {u}_{k} \in \mathbb {U}} \left \{{\sum _{i=0}^{N-1} \ell (x_{i|k}, u_{i|k}) + V_{\pi }^{\ast}(x_{N|k})}\right \} \tag{27}\end{equation*}
then
V∗π(xk)
is an upper bound of
Vtπ(xk)
, that is,
∀t:Vtπ(xk)≤V∗π(xk).(28)
View Source
\begin{equation*} \forall t: V^{t}_{\pi }(x_{k}) \leq V^{\ast}_{\pi }(x_{k}). \tag{28}\end{equation*}
Proof:
We prove this by induction. First, at t=0
, V0π(xk)=0≤V∗π(xk)
. Second, we assume that at t=l
, Vlπ(xk)≤V∗π(xk)
holds for ∀xk∈X
, then at t=l+1
, we have
Vl+1π(xk)=≤=minuk∈U{∑i=0N−1ℓ(xi|k,ui|k)+Vlπ(xN|k)}minuk∈U{∑i=0N−1ℓ(xi|k,ui|k)+V∗π(xN|k)}V∗π(xk).(29)
View Source
\begin{align*} V_{\pi} ^{l+1}(x_{k})=&\min _{ \boldsymbol {u}_{k} \in \mathbb {U}} \left \{{\sum _{i=0}^{N-1} \ell (x_{i|k}, u_{i|k}) + V_{\pi }^{l}(x_{N|k})}\right \} \\\leq&\min _{ \boldsymbol {u}_{k} \in \mathbb {U}} \left \{{\sum _{i=0}^{N-1} \ell (x_{i|k}, u_{i|k}) + V_{\pi }^{\ast}(x_{N|k})}\right \} \\=&V^{\ast}_{\pi }(x_{k}). \tag{29}\end{align*}
Thus, Vtπ(xk)≤V∗π(xk)
holds for ∀t∈N
, and V∗π(xk)
is an upper bound. The proof is completed.
Based on the properties revealed by Theorems 1 and 2, we give the following theorem to show the convergence of RLMPC.
Theorem 3:
For ∀t∈N
and xk∈X
, let Vtπ(xk)
and πt(xk)
be obtained by PI presented in Section III-C. When t→∞
, limt→∞Vtπ(xk)=V∞π(xk)=V∗π(xk)
and limt→∞πt(xk)=π∞(xk)=π∗(xk)
.
Proof:
From Theorems 1 and 2, we know that Vtπ(xk)
is a uniformly monotonically nondecreasing and upper-bounded sequence about t
. According to the monotone convergence theorem [43], Vtπ(xk)
converges to a value, denoted by V∞π(xk)
, as t→∞
. Since limt→∞Vtπ(xk)=limt→∞Vt+1π(xk)=V∞π(xk)
, we obtain
V∞π(xk)=====limt→∞Vt+1π(xk)limt→∞minuk∈U{∑i=0N−1ℓ(xi|k,ui|k)+Vtπ(xN|k)}minuk∈U{∑i=0N−1ℓ(xi|k,ui|k)+limt→∞Vtπ(xN|k)}minuk∈U{∑i=0N−1ℓ(xi|k,ui|k)+limt→∞Vt+1π(xN|k)}∑i=0N−1ℓ(x∞i|k,u∞i|k)+V∞π(x∞N|k)
View Source
\begin{align*} V_{\pi} ^{\infty }(x_{k})=&\lim _{t \rightarrow \infty } V_{\pi} ^{t+1}(x_{k}) \\=&\lim _{t \rightarrow \infty } \min _{ \boldsymbol {u}_{k} \in \mathbb {U}} \left \{{\sum _{i=0}^{N-1} \ell (x_{i|k}, u_{i|k}) + V_{\pi }^{t}(x_{N|k})}\right \} \\=&\min _{ \boldsymbol {u}_{k} \in \mathbb {U}} \left \{{\sum _{i=0}^{N-1} \ell (x_{i|k}, u_{i|k}) + \lim _{t \rightarrow \infty } V_{\pi }^{t}(x_{N|k})}\right \} \\=&\min _{ \boldsymbol {u}_{k} \in \mathbb {U}} \left \{{\sum _{i=0}^{N-1} \ell (x_{i|k}, u_{i|k}) + \lim _{t \rightarrow \infty } V_{\pi }^{t+1}(x_{N|k})}\right \} \\=&\sum _{i=0}^{N-1} \ell \left ({x^{\infty }_{i|k}, u^{\infty }_{i|k}}\right) + V_{\pi} ^{\infty }\left ({x^{\infty }_{N|k}}\right)\end{align*}
which is the Bellman optimality equation. Thus,
V∞π(xk)=V∗π(xk)
and the resulting policy
limt→∞πt(xk)=π∞(xk)=π∗(xk)
. The proof is completed.
B. Stability and Feasibility Analysis
Theorem 4:
For ∀t∈N
and xk∈X
, let Vtπ(xk)
and πt(xk)
be obtained by PI presented in Section III-C, then for every t∈N
, the RLMPC policy πt(xk)
is feasible and the closed-loop system is asymptotically stable under πt(xk)
.
Proof:
For t=0
, Assumption 3 has already ensured the asymptotic stability and feasibility of the initial policy π0(xk)
, ∀xk∈X
. Before proceeding, we define
VN~(xk)=minuN~∥k∈U∑i=0N~−1ℓ(xi|k,ui|k)(30)
View Source
\begin{equation*} V_{\tilde {N}}(x_{k})=\min _{ \boldsymbol {u}_{\tilde {N}\parallel k} \in \mathbb {U}} \sum _{i=0}^{\tilde {N}-1} \ell (x_{i|k}, u_{i|k}) \tag{30}\end{equation*}
subject to
(11b)–(11e) with
N
replaced by
N~
, where
N~
can be an arbitrary positive integer, and
uN~∥k={u0|k,u1|k,…,uN~−1|k}
is the decision variable of length
N~
. By letting
N~=N
, it is obvious that
VN(xk)=J0(xk,u0N∥k)=V1π(xk)
holds for
∀xk∈X
. Furthermore, we can also obtain
V2N(xk)===minu2N∥k∈U∑i=02N−1ℓ(xi|k,ui|k)minu2N∥k∈U{∑i=0N−1ℓ(xi|k,ui|k)+∑i=N2N−1ℓ(xi|k,ui|k)}minuN∥k∈U{∑i=0N−1ℓ(xi|k,ui|k)+VN(xN|k)}(31)
View Source
\begin{align*} V_{2N}(x_{k})=&\min _{ \boldsymbol {u}_{2N \parallel k} \in \mathbb {U}} \sum _{i=0}^{2N-1} \ell (x_{i|k}, u_{i|k}) \\=&\min _{ \boldsymbol {u}_{2N \parallel k} \in \mathbb {U}} \left \{{ \sum _{i=0}^{N-1} \ell (x_{i|k}, u_{i|k}) + \sum _{i=N}^{2N-1} \ell (x_{i|k}, u_{i|k}) }\right \} \\=&\min _{ \boldsymbol {u}_{N \parallel k} \in \mathbb {U}} \left \{{ \sum _{i=0}^{N-1} \ell (x_{i|k}, u_{i|k}) + V_{N}(x_{N|k}) }\right \} \tag{31}\end{align*}
where the last equation holds because of Bellman’s optimality principle. As a result, invoking
(22) leads to
V2N(xk)==minuN∥k∈U{∑i=0N−1ℓ(xi|k,ui|k)+V1π(xN|k)}J1(xk,u1N∥k)=V2π(xk).(32)
View Source
\begin{align*} V_{2N}(x_{k})=&\min _{ \boldsymbol {u}_{N \parallel k} \in \mathbb {U}} \left \{{ \sum _{i=0}^{N-1} \ell (x_{i|k}, u_{i|k}) + V_{\pi }^{1}(x_{N|k}) }\right \} \\=&J^{1}(x_{k}, \boldsymbol {u}_{N \parallel k}^{1})=V_{\pi }^{2}(x_{k}). \tag{32}\end{align*}
This illustrates that the policy generated by RLMPC with V1π(xN|k)
as the terminal cost is equivalent to that of the MPCWTC with a prediction horizon length of 2N
. By induction, it is easy to verify
V(t+1)⋅N(xk)=Jt(xk,utN∥k)=Vt+1π(xk)∀t∈N(33)
View Source
\begin{equation*} V_{(t+1) \cdot N}(x_{k})=J^{t}(x_{k}, \boldsymbol {u}_{N \parallel k}^{t})=V_{\pi }^{t+1}(x_{k}) \quad \forall t \in \mathbb {N} \tag{33}\end{equation*}
which implies that the policy
πt(xk)
is equivalent to the MPCWTC with prediction horizon
(t+1)⋅N
. Given that
N
satisfies
(13) and hence
(t+1)⋅N≥2lnβ/[lnβ−ln(β−1)]
also holds, the asymptotic stability and feasibility of
πt(xk)
can be guaranteed according to
Lemma 1. The proof is completed.
SECTION VI.
Simulation Examples
To evaluate the effectiveness of the proposed RLMPC scheme, two examples are studied: one involving a linear system and the other involving a nonlinear one. As is known, traditional MPC (which refers to the one described in OP 1) can attain linear quadratic optimal performance for linear systems with properly designed terminal conditions, so the linear system example can be a benchmark for RLMPC performance. For nonlinear systems, however, it can only achieve suboptimal control performance, which is where RLMPC comes in. We will highlight the superiority of RLMPC in the nonlinear system example. In both examples, we investigate RLMPC (following Algorithm 1) for two episodes. The first episode (labeled as “RLMPC Epi. 1”), called the learning episode, is to learn from scratch, starting with the initial policy to control the system; while the second episode (labeled as “RLMPC Epi. 2”), also referred to as the re-execution episode, is to redo the task under the learned value function (LVF) to check the performance.
A. Linear System
Consider the following linear system:
xk+1=Axk+Buk(34)
View Source
\begin{equation*} x_{k+1} = A x_{k} + B u_{k} \tag{34}\end{equation*}
where
xk=[x1k,x2k]T∈R2
,
uk∈R1
,
A=[1−0.10.50.9]
, and
B=[10]
. We set the running cost to be
ℓ(xk,uk)=xTkQxk+uTkRuk
with weights
Q=I∈R2
and
R=0.5
. Then, according to the traditional MPC design procedure
[9], we solve a Riccati equation and obtain
P=[1.4388−0.3409−0.34095.0793]
with terminal cost
F(xN|k)=xTN|kPxN|k
, and the terminal controller
uk=Kxk
with
K=[−0.7597−0.2128]
being the linear quadratic optimal. Correspondingly, we set RLMPC parameters to be
α=10−6
,
ε=10−8
,
p=9
, and
Φ(x)=[x1,x2,x21,x22,x1x2,x31,x21x2,x32,x1x22]
. Let the prediction horizon be
N=3
and the initial state be
x0=[2.9,2]T
. We collect 100 random samples in each iteration in the state space
x1k∈[−0.5,3.5]
,
x2k∈[−0.5,2.5]
, as well as the actual system trajectory, to form
Sk
. Our task is to drive the system states from
x0
to the origin while minimizing
(3). We examine the performance of RLMPC and compare it with those of traditional MPC and MPCWTC (labeled as “LQR (MPC)” and “MPCWTC,” respectively, in the following figures). To facilitate comparison with the optimal control performance in the infinite horizon, we do not impose any constraints on this control problem, at which point the above-designed traditional MPC is fully equivalent to LQR.
The relevant state trajectories and control inputs are reported in Fig. 2. It can be seen that the trajectories of “RLMPC Epi. 1” and “MPCWTC” are very close in the initial few steps, which is due to the fact that the initial policy of the learning episode is essentially the MPCWTC. As the policy of the learning episode improves with each control step k
, its trajectory gradually deviates from that of “MPCWTC” and converges to that of “LQR (MPC).” The evolution curves of the weight vector W
in the learning episode are shown in Fig. 3. We observe that W
converges at the 11th step (at this moment, the system states are not yet regulated to the origin), indicating the convergence of the learning process. Since Algorithm 1 is an example of performing one iteration at each step, we know that the learning process converges at the 11th iteration. To examine the optimality property, we compare the performance of “LQR (MPC)” with that of “RLMPC Epi. 2.” It can be seen that their trajectories basically coincide within a certain error range, which verifies the optimality of RLMPC, and the error mainly depends on the selection of ε
for terminating the iterations.
Fig. 4 illustrates the accumulated costs (ACCs), which is defined as the sum of the running costs from the initial time to the current time k
ACC=∑i=0kℓ(xi,ui).(35)
View Source
\begin{equation*} \text {ACC}=\sum _{i=0}^{k} \ell (x_{i}, u_{i}). \tag{35}\end{equation*}
As expected, the ACC of “MPCWTC” is the highest because it only looks at N
steps ahead and ignores all the residual costs in each control period. The ACC of “RLMPC Epi. 1” ranks the second highest as its policy is gradually improving from “MPCWTC. ” “RLMPC Epi. 2” achieves almost the same minimum as “LQR (MPC),” while the slightly higher value in ACC (about 4×10−5
) is mainly due to the precision setting of the iteration termination. In conclusion, the proposed RLMPC approach can deliver the (near-)optimal policy in the sense of an infinite horizon for the control of a linear system.
To verify the statements in Remark 6, we first test the equivalence of RLMPC and the MPCWTC with an accumulating prediction horizon. Since the prediction horizon of RLMPC is set to be N=3
, following Algorithm 1 and the results in Section IV-B, it is theoretically equivalent to the MPCWTC with N=3⋅k
, where k∈N>0
is the step index. We observe from the two subgraphs in the upper part of Fig. 5 that their inputs and hence their trajectories are basically the same, which confirms this equivalence. We then investigate the effectiveness of increasing the prediction horizon by comparing the performance of RLMPC in the learning episode with N=3,5,7,9,11
, respectively. As can be seen from the lower subgraph of Fig. 5, the larger the prediction horizon, the closer the corresponding trajectory is to the optimal trajectory “LQR (MPC).” This can be explained by the fact that the equivalent MPCWTC has an increased accumulating prediction horizon in every iteration, so the policy at each step is closer to the optimal one. Furthermore, we present the ACCs and the CRs (which are measured by the number of iterations required for convergence) in Table I. It is obvious that a larger prediction horizon brings lower ACC and faster convergence in the learning episode, which is consistent with our theoretical results.
We finally investigate the data efficiency of the proposed scheme and compare it with Q-learning [44] and PILCO [45]. The former employs the action-value function and has a very similar learning process to the proposed scheme, except that it is model-free. The latter is a model-based policy search scheme and is well known for its data efficiency. By implementing Q-learning and PILCO in the optimal control of (34), we observe that the amount of training data required for them to converge to the optimal policy is 1.3×104
and 400 (10 trials and 40 training data in each trial), respectively. In contrast, RLMPC only needs 341 training data, far less than Q-learning, and achieves efficiency comparable to that of PILCO. This is mainly because Q-learning utilizes no prior knowledge of the system and therefore requires a large amount of data from interactions with the environment to provide a basis for policy improvement. While PILCO and RLMPC optimize the policy with the assistance of the model, thus significantly reducing the number of interactions and accelerating convergence. In addition, due to the combination with MPC, the significant advantage of RLMPC over PILCO is the theoretical guarantee for the stability and feasibility of the policies generated throughout the learning process.
B. Nonlinear System
Consider the regulation problem of a nonholonomic vehicle system
xk+1=xk+g(xk)uk(36)
View Source
\begin{equation*} x_{k+1}=x_{k}+g(x_{k})u_{k} \tag{36}\end{equation*}
where
xk=[χk,yk,θk]T
is the state vector, with
χk
,
yk
, and
θk
representing the abscissa, ordinate, and yaw angle in the geodetic coordinate system, respectively;
uk=[vk,ωk]T
is the control input, with
vk
and
ωk
representing the linear velocity and angular velocity, respectively;
δ
is the sampling interval; and
g(xk)=δ[cosθksinθk0001]
. Set
δ=0.2s
, the input constraint
|vk|≤1m/s
,
|ωk|≤4rad/s
, and the state constraint
0≤χ≤2m
. The control objective is to drive the system from
x0=[1.98,5,−π/3]T
to the origin while minimizing
(3) with running cost in the quadratic form
ℓ(xk,uk)=xTkQxk+uTkRuk
, where
Q=diag{1,2,0.06}
and
R=diag{0.01,0.005}
.
1) Design of the Traditional MPC and RLMPC:
In [47], a traditional MPC is proposed for (36): the terminal cost is F(xN|k)=χ2N|k+y2N|k+θ2N|k
, and the terminal set XΩ
is defined as XΩ={xN|k∈X:F(xN|k)≤ϱv}
, where ϱv=cϱ
, ϱ=min{v¯2/η2,ω¯2/ξ2}
, c=(1+T2max{η2,ξ2}−max{ηT2+q11/η+r11η,2ξT})<1
, and v¯
and ω¯
are the upper bounds of v
and ω
, respectively, that is, v¯=1
and ω¯=4
, q11
and r11
are the first elements on the diagonal of Q
and R
, respectively. By choosing η=ξ=8
, N≥30
, system (36) can be stabilized to the origin.
This traditional MPC design is computationally demanding because the prediction horizon should be at least 30 to meet the terminal constraint, which confirms Remark 9. Moreover, there is no guarantee that the control performance of the designed MPC is optimal. In contrast, RLMPC avoids artificially designing the terminal conditions and achieves (near-)optimal performance. With the learned (near-)optimal terminal cost, its prediction horizon can be flexibly adjusted without affecting its performance. Therefore, it is expected to show unique superiority in this problem.
Now we present the RLMPC design: α=10−6
, ε=10−7
, p=30
, and the basis functions are chosen as polynomials of order one to six. We generate 100 random samples in each iteration in the state space χk∈[0,2]
, yk∈[−1,6]
, θk∈[−π,π]
, together with the actual system trajectory, to form Sk
.
2) Comparison of the Traditional MPC and RLMPC:
In the following simulations, the prediction horizons of the traditional MPC and RLMPC are set to 30 and 5 steps, respectively.
As we can see from Figs. 6 and 7 that no constraint is violated during both episodes, which demonstrates that RLMPC is able to restrict the policy within the input and state constraints and thus perform safely. Fig. 8 illustrates the convergence of the weight W
. We observe that it takes 25 iterations to converge, and the evolution of the LVF is also visualized. Consistent with Theorem 1, this evolution exhibits the uniformly monotonically nondecreasing property in this process. To verify Theorem 4, we apply the policies obtained in the first 25 iterations of the learning episode, respectively, to the system and show the corresponding trajectories and ACCs [as defined in (35)] in Fig. 9. We observe that each policy generates a safe and stabilizing trajectory to the origin and that the ACCs show a gradual improvement, which also corroborates Theorem 1. The comparison of the two episodes of RLMPC and traditional MPC is shown in Fig. 7. Since the traditional MPC design for nonlinear systems does not ensure optimality, RLMPC could achieve better performance (15.09% less ACC) even in the learning episode. There is a further 16% reduction in ACC in the re-execution episode where a near-optimal policy is employed.
3) Prediction Horizon and Computational Burden:
We now demonstrate the relationship between prediction horizon, control performance, convergence, and computational burden and use the following three metrics in the quantitative evaluation: ACC, CR, and the average computational time (ACT) at each step. RLMPC in the learning episode with four different prediction horizons, namely N=5,7,9,30
, are compared (settings of other parameters remain unchanged), as well as that of traditional MPC with N=30
. Corresponding numerical results are listed in Table II.
Consistent with the linear system example, increasing the prediction horizon of RLMPC effectively enhances CR and leads to an improved (lower ACC) trajectory in the learning episode. Conceivably, this comes at the cost of increased computation time at each step. However, we observe that even with the same prediction horizon of 30, the ACT of RLMPC is still slightly lower than that of traditional MPC. This is mainly due to the removal of the terminal constraint, which simplifies the optimization problem.
We now perform a verification of Remark 7. Taking the learned (near-)optimal value function as the terminal cost of RLMPC, we reduce the prediction horizon to 1 and re-execute the regulation task. As we can see from Fig. 10, the trajectories and input under N=1
are basically coincide with those under N=5
, which confirms that shortening the prediction horizon does not affect the (near-)optimal performance. More importantly, the ACT time is reduced from 0.0826 to 0.0222 s (about an improvement of 73%), which illustrates the effectiveness of RLMPC in reducing the computational burden.